7 research outputs found
Cross-Domain HAR: Few Shot Transfer Learning for Human Activity Recognition
The ubiquitous availability of smartphones and smartwatches with integrated
inertial measurement units (IMUs) enables straightforward capturing of human
activities. For specific applications of sensor based human activity
recognition (HAR), however, logistical challenges and burgeoning costs render
especially the ground truth annotation of such data a difficult endeavor,
resulting in limited scale and diversity of datasets. Transfer learning, i.e.,
leveraging publicly available labeled datasets to first learn useful
representations that can then be fine-tuned using limited amounts of labeled
data from a target domain, can alleviate some of the performance issues of
contemporary HAR systems. Yet they can fail when the differences between source
and target conditions are too large and/ or only few samples from a target
application domain are available, each of which are typical challenges in
real-world human activity recognition scenarios. In this paper, we present an
approach for economic use of publicly available labeled HAR datasets for
effective transfer learning. We introduce a novel transfer learning framework,
Cross-Domain HAR, which follows the teacher-student self-training paradigm to
more effectively recognize activities with very limited label information. It
bridges conceptual gaps between source and target domains, including sensor
locations and type of activities. Through our extensive experimental evaluation
on a range of benchmark datasets, we demonstrate the effectiveness of our
approach for practically relevant few shot activity recognition scenarios. We
also present a detailed analysis into how the individual components of our
framework affect downstream performance
Assessing the State of Self-Supervised Human Activity Recognition using Wearables
The emergence of self-supervised learning in the field of wearables-based
human activity recognition (HAR) has opened up opportunities to tackle the most
pressing challenges in the field, namely to exploit unlabeled data to derive
reliable recognition systems for scenarios where only small amounts of labeled
training samples can be collected. As such, self-supervision, i.e., the
paradigm of 'pretrain-then-finetune' has the potential to become a strong
alternative to the predominant end-to-end training approaches, let alone
hand-crafted features for the classic activity recognition chain. Recently a
number of contributions have been made that introduced self-supervised learning
into the field of HAR, including, Multi-task self-supervision, Masked
Reconstruction, CPC, and SimCLR, to name but a few. With the initial success of
these methods, the time has come for a systematic inventory and analysis of the
potential self-supervised learning has for the field. This paper provides
exactly that. We assess the progress of self-supervised HAR research by
introducing a framework that performs a multi-faceted exploration of model
performance. We organize the framework into three dimensions, each containing
three constituent criteria, such that each dimension captures specific aspects
of performance, including the robustness to differing source and target
conditions, the influence of dataset characteristics, and the feature space
characteristics. We utilize this framework to assess seven state-of-the-art
self-supervised methods for HAR, leading to the formulation of insights into
the properties of these techniques and to establish their value towards
learning representations for diverse scenarios.Comment: update
Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition
Human activity recognition (HAR) in wearable computing is typically based on
direct processing of sensor data. Sensor readings are translated into
representations, either derived through dedicated preprocessing, or integrated
into end-to-end learning. Independent of their origin, for the vast majority of
contemporary HAR, those representations are typically continuous in nature.
That has not always been the case. In the early days of HAR, discretization
approaches have been explored - primarily motivated by the desire to minimize
computational requirements, but also with a view on applications beyond mere
recognition, such as, activity discovery, fingerprinting, or large-scale
search. Those traditional discretization approaches, however, suffer from
substantial loss in precision and resolution in the resulting representations
with detrimental effects on downstream tasks. Times have changed and in this
paper we propose a return to discretized representations. We adopt and apply
recent advancements in Vector Quantization (VQ) to wearables applications,
which enables us to directly learn a mapping between short spans of sensor data
and a codebook of vectors, resulting in recognition performance that is
generally on par with their contemporary, continuous counterparts - sometimes
surpassing them. Therefore, this work presents a proof-of-concept for
demonstrating how effective discrete representations can be derived, enabling
applications beyond mere activity classification but also opening up the field
to advanced tools for the analysis of symbolic sequences, as they are known,
for example, from domains such as natural language processing. Based on an
extensive experimental evaluation on a suite of wearables-based benchmark HAR
tasks, we demonstrate the potential of our learned discretization scheme and
discuss how discretized sensor data analysis can lead to substantial changes in
HAR
Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition
To properly assist humans in their needs, human activity recognition (HAR)
systems need the ability to fuse information from multiple modalities. Our
hypothesis is that multimodal sensors, visual and non-visual tend to provide
complementary information, addressing the limitations of other modalities. In
this work, we propose a multi-modal framework that learns to effectively
combine features from RGB Video and IMU sensors, and show its robustness for
MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the
first stage, each input encoder learns to effectively extract features, and in
the second stage, learns to combine these individual features. We show
significant improvements of 22% and 11% compared to video only and IMU only
setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive
experimentation, we show the robustness of our model on zero shot setting, and
limited annotated data setting. We further compare with state-of-the-art
methods that use more input modalities and show that our method outperforms
significantly on the more difficult MMact dataset, and performs comparably in
UTD-MHAD dataset
The role of representations in human activity recognition
We investigate the role of representations in sensor based human activity recognition (HAR). In particular, we develop convolutional and recurrent autoencoder architectures for feature learning and compare their performance to a distribution-based representation as well as a supervised deep learning representation based on the DeepConvLSTM architecture. This is
motivated by the promises deep learning methods offer – they learn end-to-end, eliminate the necessity for hand crafting features and generalize well across tasks and datasets. The
choice of studying unsupervised learning methods is motivated by the fact that they afford the possibility of learning meaningful representations without the need for labeled data. Such representations allow for leveraging large, unlabeled datasets for performing feature and transfer learning. The study is performed on five datasets which are diverse in terms of the number of subjects, activities, and settings. The analysis is performed from a wearables standpoint, considering factors such as memory footprint, the effect of dimensionality, and computation time. We find that the convolutional and recurrent autoencoder based representations outperform the distribution-based representation on all datasets. Additionally, we conclude that autoencoder based representations offer comparable performance to supervised Deep-ConvLSTM based representation. On the larger datasets with multiple sensors such as Opportunity and PAMAP2, the convolutional and recurrent autoencoder based representations are observed to be highly effective. Resource-constrained scenarios justify the utilization of the distribution-based representation, which has low computational costs and memory requirements. Finally, when the number of sensors is low, we observe that the vanilla autoencoder based representations produce good performance.M.S